High volume web services, maxconnections and tcpgoon

dani(dot)caba at gmail(dot)com

Dec 17th, 2017

Who are you?

dc_profile := map[string]string{
  "name": "Daniel Caballero",
  "title": "Devops Engineer",
  "mail": "dani(dot)caba at gmail(dot)com",
  "company": &SchibstedPT,
  "previously_at": []company{&NTTEurope, &Semantix, &Oracle},
  "linkedin": http.Get("https://www.linkedin.com/in/danicaba"),
  "extra": "Gestión DevOps de Arquitecturas IT@LaSalle",
}

What are you bringing here?

A little bit of context...

What is Schibsted?

And SPT?

And YAMS?

Where did everything start?

2016-06-11 - Incident

Actual root issue

(Easy) fix...

$ cat /etc/security/limits.d/tomcat.conf
tomcat hard nofiles 10240
$ cat /etc/tomcat7/server.xml
...
<Connector port="8080" protocol="org.apache.coyote.http11.Http11NioProtocol"
connectionTimeout="20000"
redirectPort="8443" />
...

And some testing...

func connection_handler(id int, host string, port int, wg *sync.WaitGroup) {
    fmt.Println("\t runner "+strconv.Itoa(id)+" is initiating a connection")
    conn, err := net.Dial("tcp", host+":"+strconv.Itoa(port))
    if err != nil {
        fmt.Println(err)
        os.Exit(1)
    }
    fmt.Println("\t runner "+strconv.Itoa(id)+" established the connection")
    connBuf := bufio.NewReader(conn)
    for{
        str, err := connBuf.ReadString('\n')
        if len(str)>0 {
            fmt.Println(str)
        }
        if err!= nil {
            break
        }
    }
    fmt.Println("\t runner "+strconv.Itoa(id)+" got its connection closed")
    wg.Done()
}

func run_threads(numberConnections int, delay int, host string, port int) {
    runtime.GOMAXPROCS(numberConnections)

    var wg sync.WaitGroup
    wg.Add(numberConnections)

    for runner:= 1; runner <= numberConnections ; runner++ {
        fmt.Println("Initiating runner # "+strconv.Itoa(runner))
        go connection_handler(runner, host, port, &wg)
        time.Sleep(time.Duration(delay) * time.Millisecond)
        fmt.Println("Runner "+strconv.Itoa(runner)+" initated. Remaining: "+strconv.Itoa(numberConnections-runner))
    }

    fmt.Println("Waiting runners to finish")
    wg.Wait()
}

func main() {
    hostPtr := flag.String("host", "localhost", "Host you want to open tcp connections against")
    portPtr := flag.Int("port", 8888, "Port you want to open tcp connections against")
    numberConnectionsPtr := flag.Int("connections", 100, "Number of connections you want to open")
    delayPtr := flag.Int("delay", 10, "Number of ms you want to sleep between each connection creation")

    flag.Parse()

    run_threads(*numberConnectionsPtr, *delayPtr, *hostPtr, *portPtr )

    fmt.Println("\nTerminating Program")
}

Executing the tests...

% ./tcpMaxConn -host ec2-54-229-56-140.eu-west-1.compute.amazonaws.com -port 8080 -connections 5 
Initiating runner # 1
         runner 1 is initiating a connection
Runner 1 initated. Remaining: 4
Initiating runner # 2
         runner 2 is initiating a connection
Runner 2 initated. Remaining: 3
Initiating runner # 3
         runner 3 is initiating a connection
Runner 3 initated. Remaining: 2
Initiating runner # 4
         runner 4 is initiating a connection
Runner 4 initated. Remaining: 1
Initiating runner # 5
         runner 5 is initiating a connection
Runner 5 initated. Remaining: 0
Waiting runners to finish
         runner 2 established the connection
         runner 1 established the connection
         runner 4 established the connection
         runner 3 established the connection
         runner 5 established the connection
         runner 2 got its connection closed
         runner 1 got its connection closed
         runner 4 got its connection closed
         runner 5 got its connection closed
         runner 3 got its connection closed

Terminating Program

Is this everything?

No...

Supporting a relatively high number of connections in parallel...

...is an easy job...

... but fragile.

Points to bear in mind

  • OS (net.core.somaxconn, ethernet cards queues, devices backlogs...)
  • max file descriptor limits for the user running the service
  • in a multiprocess model (legacy?), max processes limits for the user running the service
  • Connector/listener in your application / application server
  • Associated thread pools / incoming requests queue (if applies)
  • Probably you also want pooling-multiplexing against backends
  • Don't forget about other processes using resources
  • And is it there a load balancer in front of you? More considerations may apply

If you break a single item, you hit the ground

Plus it may not manifest soon; you realize when:

  • Lots of ELBs in front of you (normally under high load) pre-opening hundreds of connections
  • Or issues with backend components (slow responses?) so in flight connections increase drastically

Ook, but you are careful, and you review PRs... you are safe

Really?

New incident

Incident 2017-10-31

AND Incident 2017-11-24

WTF

The solution

Continuous testing coverage

Obvious, but...

  • Its does not mean building a new feature.
  • Neither building a new service.
  • It does not reduce the service's cost.
  • It does not directly help another team
  • It does not increase Infra visibility
  • And the chance we hit again the same issue "is low"

So, of course, no focus...

Let's use personal OKRs for something fun&usable!

Approach

  • Mission: We want something we can easily plug to our test suite that checks a single instance of our service do support an specific number of parallel TCP connections
  • Given it requires a deployed version of your application (ideally the same you will use for production), the acceptance test phase is the target place to plug this check.

How does it look like?

% ./tcpgoon --help
tcpgoon tests concurrent connections towards a server listening on a TCP port

Usage:
  tcpgoon [flags] <host> <port>

Flags:
  -y, --assume-yes         Force execution without asking for confirmation
  -c, --connections int    Number of connections you want to open (default 100)
  -d, --dial-timeout int   Connection dialing timeout, in ms (default 5000)
  -h, --help               help for tcpgoon
  -i, --interval int       Interval, in seconds, between stats updates (default 1)
  -s, --sleep int          Time you want to sleep between connections, in ms (default 10)
  -v, --verbose            Print debugging information to the standard error

% ./tcpgoon myhttpsamplehost.com 80 --connections 10 --sleep 999 -y 
Total: 10, Dialing: 0, Established: 0, Closed: 0, Error: 0, NotInitiated: 10
Total: 10, Dialing: 1, Established: 1, Closed: 0, Error: 0, NotInitiated: 8
Total: 10, Dialing: 1, Established: 2, Closed: 0, Error: 0, NotInitiated: 7
Total: 10, Dialing: 1, Established: 3, Closed: 0, Error: 0, NotInitiated: 6
Total: 10, Dialing: 1, Established: 4, Closed: 0, Error: 0, NotInitiated: 5
Total: 10, Dialing: 1, Established: 5, Closed: 0, Error: 0, NotInitiated: 4
Total: 10, Dialing: 1, Established: 6, Closed: 0, Error: 0, NotInitiated: 3
Total: 10, Dialing: 1, Established: 7, Closed: 0, Error: 0, NotInitiated: 2
Total: 10, Dialing: 1, Established: 8, Closed: 0, Error: 0, NotInitiated: 1
Total: 10, Dialing: 1, Established: 9, Closed: 0, Error: 0, NotInitiated: 0
Total: 10, Dialing: 0, Established: 10, Closed: 0, Error: 0, NotInitiated: 0
--- myhttpsamplehost.com:80 tcp test statistics ---
Total: 10, Dialing: 0, Established: 10, Closed: 0, Error: 0, NotInitiated: 0
Response time stats for 10 established connections min/avg/max/dev = 17.929ms/19.814ms/29.811ms/3.353ms
% echo $?
0

And internally?

Let's see some code

func TCPConnect(id int, host string, port int, wg *sync.WaitGroup,
    statusChannel chan<- Connection, closeRequest <-chan bool) error {
...
for {
        select {
        case <-closeRequest:
            fmt.Fprintln(debugging.DebugOut, "Connection", id, "is being requested to close")
            wg.Done()
            return nil
        default:
            const ReadTimeoutAndBetweenPollsInMs = 1000
            conn.SetReadDeadline(time.Now().Add(time.Duration(ReadTimeoutAndBetweenPollsInMs) * time.Millisecond))
            str, err := connBuf.ReadString('\n')
            if terr, ok := err.(net.Error); ok && terr.Timeout() {
                fmt.Fprintln(debugging.DebugOut, "No info from connection", id, "before timing out. Reading again...")
            } else if err != nil {
                fmt.Fprintln(debugging.DebugOut, "Connection", id, "looks closed. Error", reflect.TypeOf(err), "when reading:")
                fmt.Fprintln(debugging.DebugOut, err)
                connectionDescription.status = ConnectionClosed
                reportConnectionStatus(statusChannel, connectionDescription)
                wg.Done()
                return err
            } else if len(str) > 0 {
                fmt.Fprintln(debugging.DebugOut, "Connection", id, "got", str)
            }
        }

    }
}

func StartBackgroundClosureTrigger(gc GroupOfConnections) <-chan bool {
    closureCh := make(chan bool)

    signalsCh := make(chan os.Signal, 1)
    registerProperSignals(signalsCh)

    go closureMonitor(gc, signalsCh, closureCh)
    return closureCh
}

func registerProperSignals(signalsCh chan os.Signal) {
    signal.Notify(signalsCh, syscall.SIGINT, syscall.SIGTERM)
}

func closureMonitor(gc GroupOfConnections, signalsCh chan os.Signal,
    closureCh chan bool) {
    const pullingPeriodInMs = 500
    for {
        select {
        case signal := <-signalsCh:
            fmt.Fprintln(debugging.DebugOut, "We captured a closure signal:", signal)
            close(closureCh)
            return
        case <-time.After(pullingPeriodInMs * time.Millisecond):
            if !gc.PendingConnections() {
                close(closureCh)
                return
            }
        }
    }
}

Baking

Nothing especially interesting (a docker wrapper does exist so we can run travis logic locally):

./_script/test
./_script/formatting_checks

TRAVIS_PULL_REQUEST=${TRAVIS_PULL_REQUEST:-""}
TRAVIS_BRANCH=${TRAVIS_BRANCH:-""}
if [ "$TRAVIS_PULL_REQUEST" == "false" ] && [ "$TRAVIS_BRANCH" = "master" ]
then
    echo "INFO: Merging to master... time to build and deploy redistributables"
    docker_name="dachad/tcpgoon"
    ./_script/build "$docker_name"
    ./_script/deploy "$docker_name"
fi

But...

No, you cannot just move binaries around

function binary_compilation {
    # mix of:
    #  traefik build
    #  http://blog.wrouesnel.com/articles/Totally%20static%20Go%20builds/
    #  https://www.osso.nl/blog/golang-statically-linked/
    #  https://github.com/kubernetes/kubernetes/pull/26028/files
    # see also https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html
    CGO_ENABLED=0 GOOS=linux go build -o out/tcpgoon \
            -ldflags '-extldflags "-static"' -a -installsuffix nocgo -tags netgo
}

Testing...

Are we testing this test? :)

  • A basic tcpserver is included and in use by the project tests.
  • Eureka integration is using dockertest to initialize and shutdown a dockered Eureka instance.

Q&As

I'd have done the same just with scripting or a fancy tool

Maybe. But goroutines do work very well in this scenario. Plus some research did not bring something effective in front of this specific scenario.

Where does the project name come from?

Goon: /ɡuːn/ noun informal; noun: goon; plural noun: goons ; 
...
2.
NORTH AMERICAN
a bully or thug, especially a member of an armed or security force.
...

Please, do not read it as "TCP-Go-On". Its awful. Very.

This is a very dangerous tool

Probably. Knifes are also dangerous. And you can buy knifes. We cannot prevent bad usage.

Worth to say, when executing the tool, an explicit confirmation from the user is required (non-interactive executions require the flag -y/--asume-yes)

I've been taking a look, and your code sucks...

Probably. We do accept PRs :)

How many connections can you open from a single client?

Depends on how many connections do you support in your client machine :) . No official benchmark/stress test yet, but able to open between 5k-10k without problems from my laptop.

Can I use the tool now?

Yes. And a public docker image is available to facilitate the job:

% WHALEBREW_INSTALL_PATH=$HOME/bin whalebrew install dachad/tcpgoon
🐳  Installed dachad/tcpgoon to /home/caba/bin/tcpgoon
% tcpgoon myhttpsamplehost.com 80 -c 2 -y 
Total: 2, Dialing: 0, Established: 0, Closed: 0, Error: 0, NotInitiated: 2
Total: 2, Dialing: 0, Established: 2, Closed: 0, Error: 0, NotInitiated: 0
--- myhttpsamplehost.com:80 tcp test statistics ---
Total: 2, Dialing: 0, Established: 2, Closed: 0, Error: 0, NotInitiated: 0
Response time stats for 2 established connections min/avg/max/dev = 57.606ms/63.499ms/69.391ms/5.892ms
% echo $?
0

What's in your backlog?

See our public issue list in Github. In addition to Service Discovery integration (now Eureka), one of the expected big changes is implementing extra "modes":

  • Incremental mode, and report first failure (stress)
  • Time-restricted execution: a target is defined and execution is successful if it does not time out.

Is this now part of YAMS' acceptance tests?

Not yet. Stress test ELBs is not the objective, so Service Discovery integration is required (& ongoing).

Closure...

The gifts

  • When assessing post mortems, do not stop until the very last root cause is clear
  • One time solutions suck
  • Golang works well for building low level utilities
  • Code requires continuous testing. Deliverables too.
  • And Golang is easy&fun :)

Special thanks to...

  • chadell, also owning the project
  • my wife, who created our gopher

Further questions?

Enjoy Christmas!